Wiederholung Tools and Programming

Lecture 1 - Basics

Working directory

Please check first where you are by: getwd()

If you would like to set a new working directory use the following command: setwd(“/Users/alexander/Documents/Master Kiel/Tools and Programming Languages”)

Simple mathematical operations

[1] 4
[1] 2
[1] 2
[1] 12
[1] 100
[1] 10

Inspect your objects

View(e) (Will only be operated in the R-Studio inveroment)

Inspect your first dataset

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
[1] "data.frame"
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
[5] "Species"     
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
[1] "setosa"     "versicolor" "virginica" 
[1] "list"

$names
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
[5] "Species"     

$class
[1] "data.frame"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
[70] 70 71 72 73 74 75
 [ reached getOption("max.print") -- omitted 75 entries ]

Why df ist class = data frame and type = list

‘mode/type’ is a mutually exclusive classification of objects according to their basic structure. The ‘atomic’ modes are numeric, complex, character and logical. Recursive objects have modes such as ‘list’ or ‘function’ or a few others. An object has one and only one mode.

‘class’ is a property assigned to an object that determines how generic functions operate with it. It is not a mutually exclusive classification. If an object has no specific class assigned to it, such as a simple numeric vector, it’s class is usually the same as its mode, by convention.

Install a package

install.packages(‘dplyr’) (Will only be operated in the R-Studio environment)

First manipulation of a data frame

First error and the help function

[1] NA

help(sum) (Will only be operated in the R-Studio environment) Default: sum(…, na.rm = FALSE) Change the default setting to operate

[1] 36

Lecture 2 - Basics

Insert data frame from the package or from the working directory

df <- read.csv(“name.csv”)

Exploring of the data

[1] "data.frame"
[1] "list"
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
[1] 32
[1] 11
[1] 32 11
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
[1] 32
[1] 11
[1] 32 11
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
$names
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

$row.names
 [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
 [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
 [7] "Duster 360"          "Merc 240D"           "Merc 230"           
[10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
[13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
[16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
[19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
[22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
[25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
[28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
[31] "Maserati Bora"       "Volvo 142E"         

$class
[1] "data.frame"
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

Subsetting of the data

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
[29] 15.8 19.7 15.0 21.4
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
                   mpg cyl disp  hp drat   wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2
                     mpg cyl  disp
Mazda RX4           21.0   6 160.0
Mazda RX4 Wag       21.0   6 160.0
Datsun 710          22.8   4 108.0
Hornet 4 Drive      21.4   6 258.0
Hornet Sportabout   18.7   8 360.0
Valiant             18.1   6 225.0
Duster 360          14.3   8 360.0
Merc 240D           24.4   4 146.7
Merc 230            22.8   4 140.8
Merc 280            19.2   6 167.6
Merc 280C           17.8   6 167.6
Merc 450SE          16.4   8 275.8
Merc 450SL          17.3   8 275.8
Merc 450SLC         15.2   8 275.8
Cadillac Fleetwood  10.4   8 472.0
Lincoln Continental 10.4   8 460.0
Chrysler Imperial   14.7   8 440.0
Fiat 128            32.4   4  78.7
Honda Civic         30.4   4  75.7
Toyota Corolla      33.9   4  71.1
Toyota Corona       21.5   4 120.1
Dodge Challenger    15.5   8 318.0
AMC Javelin         15.2   8 304.0
Camaro Z28          13.3   8 350.0
Pontiac Firebird    19.2   8 400.0
 [ reached 'max' / getOption("max.print") -- omitted 7 rows ]
                     mpg  hp
Mazda RX4           21.0 110
Mazda RX4 Wag       21.0 110
Datsun 710          22.8  93
Hornet 4 Drive      21.4 110
Hornet Sportabout   18.7 175
Valiant             18.1 105
Duster 360          14.3 245
Merc 240D           24.4  62
Merc 230            22.8  95
Merc 280            19.2 123
Merc 280C           17.8 123
Merc 450SE          16.4 180
Merc 450SL          17.3 180
Merc 450SLC         15.2 180
Cadillac Fleetwood  10.4 205
Lincoln Continental 10.4 215
Chrysler Imperial   14.7 230
Fiat 128            32.4  66
Honda Civic         30.4  52
Toyota Corolla      33.9  65
Toyota Corona       21.5  97
Dodge Challenger    15.5 150
AMC Javelin         15.2 150
Camaro Z28          13.3 245
Pontiac Firebird    19.2 175
Fiat X1-9           27.3  66
Porsche 914-2       26.0  91
Lotus Europa        30.4 113
Ford Pantera L      15.8 264
Ferrari Dino        19.7 175
Maserati Bora       15.0 335
Volvo 142E          21.4 109

Lecture 3 - Basics

Data types

Vectors constitute the most important family of data types in R. There are two fundamentally different types of vectors, atomic vectors and lists. Atomic vectors have homogeneous element types, i.e. they cannot mix numbers, characters and logical values, whereas lists can have heterogeneous element types.

Atomic vectors

[1] "data"
[1] 1
[1] 2.1
[1] TRUE
[1] "data"    "science" "rocks"  
[1]   1  10 100
[1]   2.100000   3.141593 100.000000
[1]  TRUE FALSE FALSE

Task 1: Confirm the internal storage type of the atomic vectors char, int, dbl, lgl using the function typeof? Use the classfunction to check their classes?

[1] "character"
[1] "character"

Type coercion

What happens if we combine different types in one vector? R automatically coerces to the more flexible type

[1] "data" "1"    "10"   "100" 
[1] "character"

Not everything is coerced as you might wish: functions may require specific input types:

[1] 2
[1] 2
Error in sum(char): ungültiger 'type' (character) des Argumentes

Task 2: Rank the 4 common atomic vector types from most / least flexible. To answer this question create different cominations of the atomic vectors int, dbl, lgl, and char using the c() function and evaluate the type of the combined vector via typeof().


Factor, date, and datetime objects

There are several other vector classes built on top of atomic vectors. The most important ones are factor, Date and the datetime object POSiXct. They are stored as atomic vectors with attached attributes. They all have special properties that are helpful in practice. For instance, the Date class enables us to calculate with dates, sort dates, and print dates in a readable way.

Factors

Factors are built on top of integer vectors. They are stored as integers with attached labels. Factors are often useful for representing ordered or unordered categorical data.

Let’s turn a character vector of clothing sizes sold in a shop into a factor:

[1] XXL S   M   S   L   M   S   XXL
Levels: S M L XXL

Task 3: Let’s try to understand better what a factor is. Apply the functions class, typeof, str, and attributes onsizes_factor and observe the output.

[1] "factor"
[1] "integer"
 Factor w/ 4 levels "S","M","L","XXL": 4 1 2 1 3 2 1 4
$levels
[1] "S"   "M"   "L"   "XXL"

$class
[1] "factor"
  • class: class of the object
  • typeof: storage mode of the object
  • attributes(): list of the object’ s attributes
  • str(): compact display of the object’s internal structure

Task 4: To see why factor variables are sometimes useful, compare the output of the summary function for the vectors sizes_char and sizes_factor. Which output is more useful?

   Length     Class      Mode 
        8 character character 
  S   M   L XXL 
  3   2   1   2 

Date

The date class represents dates as the number of days since 1970-01-01 and internally stores them as double vector. This enables us to sort dates and calculate with dates: add, subtract, create date sequences, etc.

[1] "2019-12-13"
[1] "2019-12-14"
[1] "2019-12-13" "2020-12-13" "2021-12-13" "2022-12-13" "2023-12-13"

Let’s confirm that 1970-01-01 is day 0 for the R Date class

[1] 0

How many days have passed since day 0?

Time difference of 18243 days

Task 5: How are dates prior to 1970-01-01 stored? Let’s try 1969-12-31?


Datetime (POSIXct)

The POSIXct class represents datetimes as the number of seconds since 1970-01-01 00:00:00 and internally stores them as double vector. This enables us to sort datetimes and calculate with datetimes: add, subtract, create date sequences, etc.

[1] "2019-12-13 13:41:51 CET"
[1] "2019-12-13 13:41:51 CET" "2019-12-23 13:41:51 CET"
[3] "2020-01-02 13:41:51 CET" "2020-01-12 13:41:51 CET"
[5] "2020-01-22 13:41:51 CET"

Task 6: Check above which timezone is displayed? Is it CEST (central european standard time) or GMT/UTC (Greenwich mean time) or some other time? If you wanted to change your timezone setting, consult help(timezones).


Matrices and arrays

All atomic vectors can be turned into a matrix (2-dimensional) or an array (multi-dimensional) via a dimension attribute. Internally matrices and arrays are stored as atomic vectors, but R treates them differently. If you apply functions on atomic vectors, matrices and arrays, different things will happen.

Let’s create an integer vector and convert it into a matrix and an array.

Let’s check how matrices and arrays are printed

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    3    5    7    9   11
[2,]    2    4    6    8   10   12
, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

Let’s analyse the nature of matrices and arrays as compared to the integer vector using the functions typeof, class and dim

[1] "integer" "matrix"  "array"  
[1] "integer" "integer" "integer"
[[1]]
NULL

[[2]]
[1] 2 6

[[3]]
[1] 2 3 2

Task 7: Above we turned a vector into a matrix and an array using the dedicated functions matrix and array. But there’s another way to achieve this: by setting the dimension attribute (dim). Try some several 2 or 3 dimensional combinations. Observe how printing and the class changes.

     [,1] [,2] [,3] [,4] [,5]
[1,] "a"  "e"  "i"  "m"  "q" 
[2,] "b"  "f"  "j"  "n"  "r" 
[3,] "c"  "g"  "k"  "o"  "s" 
[4,] "d"  "h"  "l"  "p"  "t" 
[1] "matrix"

List

Lists are 1-dimensional objects, just like atomic vectors. But different from atomic vectors the list elements can be heterogeneous. A list can combine vectors with data frames, arrays and any other R object (functions, formulas, etc.).

Let’s create a simple list containing 3 different vector classes (integer, character, Date) and inspect how the list is printed.

$numbers
 [1]  1  2  3  4  5  6  7  8  9 10

$letters
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

$dates
 [1] "2019-12-13" "2019-12-14" "2019-12-15" "2019-12-16" "2019-12-17"
 [6] "2019-12-18" "2019-12-19" "2019-12-20" "2019-12-21" "2019-12-22"

Lists are sometimes called recursive vectors, because one can infinitly nest lists within lists. As an example consider the (simplified) representation of the order of Primates.


Task 8: To see why nested lists can be useful, install the data.tree package and run the code chunk below for a plot of the greatly simplified phylogenetic tree of primates.


Data frame

Data frames are fundamental to data analysis and machine learning. Data frames are 2-dimensional like matrices, but can combine heterogeneous data types across columns. In terms of structure, a data frame is essentially a list of equal-length vectors with attributes for the column names (names), row names (row.names) and its class (data.frame).

Since the simple list from above consists of equal length vectors we can convert this list into a data frame:

   numbers letters      dates
1        1       A 2019-12-13
2        2       B 2019-12-14
3        3       C 2019-12-15
4        4       D 2019-12-16
5        5       E 2019-12-17
6        6       F 2019-12-18
7        7       G 2019-12-19
8        8       H 2019-12-20
9        9       I 2019-12-21
10      10       J 2019-12-22

Task 9: To better understand the nature of data frames apply the functions class, typeof, attributes and str on the data frame df.

[1] "data.frame"
[1] "list"
$names
[1] "numbers" "letters" "dates"  

$class
[1] "data.frame"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10
'data.frame':   10 obs. of  3 variables:
 $ numbers: int  1 2 3 4 5 6 7 8 9 10
 $ letters: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
 $ dates  : Date, format: "2019-12-13" "2019-12-14" ...

Task 10: Attributes can be changed. First, change the column names by assigning c("Zahlen", "Buchstaben", "Daten") to the names attribute. Second, check whether changing the class attribute to “list” actually suffices to create a list.


Generic functions

An important family of R functions is called S3 generic functions. Examples of generic functions are summary, print, plot, and mean. Generic functions interact with the class attribute of the functions first argument in a special way. Depending on the class, generic functions will do different things.

Let’s apply the generic function summary to date, factor, and character vectors.

        Min.      1st Qu.       Median         Mean      3rd Qu. 
"2019-12-13" "2020-12-13" "2021-12-13" "2021-12-12" "2022-12-13" 
        Max. 
"2023-12-13" 
  S   M   L XXL 
  3   2   1   2 
   Length     Class      Mode 
        8 character character 

Do we see that the summary function performs three different summary operations for the three different classes of objects? Let’s focus on one of the objects, the vector sizes_factor and illustrate what the summaryfunction is doing under the hood:

  1. R checks the class of the vectorsizes_factor
  2. R checks whether there is a dedicated summary method for the Date class (summary.factor)
  3. If yes, the summary.factor method is applied. If no, the summary.default method is applied
[1] "factor"
[1] TRUE
  S   M   L XXL 
  3   2   1   2 

Task 11: Use the methods command to get a list of all methods that can be invoked by the generic functions summary and mean.

[1] mean.Date        mean.default     mean.difftime    mean.POSIXct    
[5] mean.POSIXlt     mean.quosure*    mean.vctrs_vctr*
see '?methods' for accessing help and source code
 [1] summary.aov                    summary.aovlist*              
 [3] summary.aspell*                summary.check_packages_in_dir*
 [5] summary.cohesiveBlocks*        summary.connection            
 [7] summary.data.frame             summary.Date                  
 [9] summary.default                summary.ecdf*                 
[11] summary.factor                 summary.gexf*                 
[13] summary.ggplot*                summary.glm                   
[15] summary.hcl_palettes*          summary.igraph*               
[17] summary.infl*                  summary.lm                    
[19] summary.loess*                 summary.manova                
[21] summary.matrix                 summary.mlm*                  
[23] summary.nls*                   summary.packageStatus*        
[25] summary.POSIXct                summary.POSIXlt               
[27] summary.ppr*                   summary.prcomp*               
[29] summary.princomp*              summary.proc_time             
[31] summary.rlang_error*           summary.rlang_trace*          
[33] summary.srcfile                summary.srcref                
[35] summary.stepfun                summary.stl*                  
[37] summary.table                  summary.tukeysmooth*          
[39] summary.vctrs_sclr*            summary.vctrs_vctr*           
[41] summary.warnings               summary.XMLInternalDocument*  
see '?methods' for accessing help and source code

Selecting parts of an object

We often wish to select parts of an object for the purpose of extraction or replacement. We have mainly three operators to chose , [, [[ and $.

Multiple elements

The operator [ allows extracting any element or any combination of elements of an R object. Within the square brackets we specify along which dimensions we want to select, writing [dim1, dim2, ...]. In case of matrices and data frames this is [row, col].

In general, the dimension arguments within the square brackets can take three different forms:

  1. Numeric vector
  2. Logical vector
  3. Character vector

Special cases

Negative selection

We can invert a selection using the - operator for numeric vectors and using the ! operator for logical vectors.

  a   c 
100 300 
b d e 
2 4 5 
  a   c 
100 300 
b d e 
2 4 5 

Single elements

There are two other important operators: [[ and $. They are usful in the context of lists and data frames (which are internally stored as lists) when we want to select only 1 element, e.g. one column of a data frame. While the [ operator preserves the list structure, [[ and $ enable us to navigate into the list structure

Let’s consider our simple list from above:

$dates
 [1] "2019-12-13" "2019-12-14" "2019-12-15" "2019-12-16" "2019-12-17"
 [6] "2019-12-18" "2019-12-19" "2019-12-20" "2019-12-21" "2019-12-22"

 [1] "2019-12-13" "2019-12-14" "2019-12-15" "2019-12-16" "2019-12-17"
 [6] "2019-12-18" "2019-12-19" "2019-12-20" "2019-12-21" "2019-12-22"
 [1] "2019-12-13" "2019-12-14" "2019-12-15" "2019-12-16" "2019-12-17"
 [6] "2019-12-18" "2019-12-19" "2019-12-20" "2019-12-21" "2019-12-22"

The advantage becomes aparent for nested lists. Different from [, the operators [[ and $ allow navigating deeply into a nested list, and extract or replace elements there.


Task 12: Use the $ operator multiple times to navigate through the nested list primates until you reach humans. Don’t type everything by hand. Use autocompletion by pressing tab after each $ operator.

[1] "Humans"

Given that $ is less verbose and easier to read, why and when should we use [[ instead? Answer: In situations where we want to pass the selection as a variable.

List of 3
 $ numbers: int [1:10] 1 2 3 4 5 6 7 8 9 10
 $ letters: chr [1:10] "A" "B" "C" "D" ...
 $ dates  : Date[1:10], format: "2019-12-13" "2019-12-14" ...
 [1]  1  2  3  4  5  6  7  8  9 10
NULL

Application

Data frame

Now, let’s focus on dataframes and on some applications of extraction and replacement that we often see in practice. But note that we will cover packages like dplyr or data.table later in class, which make subsetting data frames much more convenient.

 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
[1] 1 2 3 4 5

Task 13: The iris data frame contains information for 150 flowers. First, extract all 12 observations with sepal length larger than 7. Second, for these flowers only select the Species column. Third, change the species names of all flowers to capital letters using the toupper function. Fourth, create an additional variable sepal_length_100 by multiplying Sepal.Length by factor 100.

    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
103          7.1           3          5.9         2.1 virginica
106          7.6           3          6.6         2.1 virginica
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
103          7.1         3.0          5.9         2.1 virginica
106          7.6         3.0          6.6         2.1 virginica
108          7.3         2.9          6.3         1.8 virginica
110          7.2         3.6          6.1         2.5 virginica
118          7.7         3.8          6.7         2.2 virginica
119          7.7         2.6          6.9         2.3 virginica
123          7.7         2.8          6.7         2.0 virginica
126          7.2         3.2          6.0         1.8 virginica
130          7.2         3.0          5.8         1.6 virginica
131          7.4         2.8          6.1         1.9 virginica
132          7.9         3.8          6.4         2.0 virginica
136          7.7         3.0          6.1         2.3 virginica

Task 14: In supervised machine learning its common practice to split the data randomly into a training set (70% of the observations) and a test set (30%). Perform this split for the iris data frame, and use negative selection to create the test data.

Loops and their alternatives

Instead of using loops it is often more R-like to use apply (base R) or map (purrr package). However, using a vectorized function like + or *, if available, is the preferred solution because the code is more performant.

As an example consider the following simple_vector. First, we want to multiply each element of this vector by factor 2 using a for loop:

 [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34
[18]  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68
[35]  70  72  74  76  78  80  82  84  86  88  90  92  94  96  98 100 102
[52] 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136
[69] 138 140 142 144 146 148 150
 [ reached getOption("max.print") -- omitted 25 entries ]

Now, let’s reformulate this operation using a member of the apply family of functions:


Task: Reformulate this operation using the vectorized * function. This is the most efficient and syntactically easiest way to do it:

Get data in and out


Task: Use one of the data frames of the datasets package to experiment with importing and exporting. And to practice some of the other aspects of base R, covered before. Experiment with subsetting, extract or change columns of the data frame, inspect or change (coerce) the data types of columns, etc. Then you can write to .csv, .RData and .Rds. Clear the environment and import the data back again.

Alexander Kleine

2019-12-13